Red & White Wine Quality by Di Wang

Objective

In this project, I will explore a data set on wine quality and the corresponding chemical contents.

Univariate Plots Section

First, let’s run some basic function to examine the structure and schema of the data.

Number of Observations and variables:

## [1] 6497   14

Field names:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "color"

Few lines of the data and summary:

## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  4898 4895 4894 4893 4892 4891 4889 4883 4882 4880 ...
##  $ fixed.acidity       : num  6 6.6 6.2 6.5 5.7 6.1 6.8 5.5 5 6.6 ...
##  $ volatile.acidity    : num  0.21 0.32 0.21 0.23 0.21 0.34 0.22 0.32 0.235 0.34 ...
##  $ citric.acid         : num  0.38 0.36 0.29 0.38 0.32 0.29 0.36 0.13 0.27 0.4 ...
##  $ residual.sugar      : num  0.8 8 1.6 1.3 0.9 ...
##  $ chlorides           : num  0.02 0.047 0.039 0.032 0.038 0.036 0.052 0.037 0.03 0.046 ...
##  $ free.sulfur.dioxide : num  22 57 24 29 38 25 38 45 34 68 ...
##  $ total.sulfur.dioxide: num  98 168 92 112 121 100 127 156 118 170 ...
##  $ density             : num  0.989 0.995 0.991 0.993 0.991 ...
##  $ pH                  : num  3.26 3.15 3.27 3.29 3.24 3.06 3.04 3.26 3.07 3.15 ...
##  $ sulphates           : num  0.32 0.46 0.5 0.54 0.46 0.44 0.54 0.38 0.5 0.5 ...
##  $ alcohol             : num  11.8 9.6 11.2 9.7 10.6 ...
##  $ quality             : int  6 5 6 5 6 6 5 5 6 6 ...
##  $ color               : chr  "White" "White" "White" "White" ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality         color          
##  Min.   : 8.00   Min.   :3.000   Length:6497       
##  1st Qu.: 9.50   1st Qu.:5.000   Class :character  
##  Median :10.30   Median :6.000   Mode  :character  
##  Mean   :10.49   Mean   :5.818                     
##  3rd Qu.:11.30   3rd Qu.:6.000                     
##  Max.   :14.90   Max.   :9.000

The quality of wine has a slightly skewed normal distribution. Most wine were rated as 5 or 6. The lowest rating is 3 and the highest rating is 9.

We would like to plot each individual factors and try to find their potential influence on wine quality.

First, let’s look at the alcohol:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

The minimum alcohol content of the sample is 8% and the maximum alcohol content is 14.9%. Mean alcohol content is 10.49. The alcohol has a skewed normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

Unlike alcohol, the range for residual sugar is great. From the wikipedia classification, dry wine has a sweetness less than 4 g/L, Medium Dry has 4-12 g/L, Medium is 12-45 g/L, Sweet is greater than 45 g/L. We would like to look at its distribtuions.

Most samples are dry wine and only a barely visible portion is sweet wine.

let’s look at acids group:

The fixed.acids have normal distribution. Volatile.acids and citric acids had skewed distribution.

Then we would like to see the chlorides and sulphates:

Both chlorides and sulphates have skewed distribtuion and significant outliers.

Due to the nature of the description, the (11) factors can be classified as following: 1. Acids 2. Sugar 3. Alcohol 4. Chlorides 5. Sulphates

We will mainly examine these (5) factors and their relationship to quality.

Univariate Analysis

What is the structure of your dataset?

There are 6497 observations of 14 variables (X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,color). Quality is an ordered, categorical, discrete variable. It was on a 0-10 scale, rated by at least 3 wine experts. The values ranged only from 3 to 9, with a mean of 5.818 and median of 6. X is the numbering system for the wine samples. Color was a created categorical factor. All other variables are all quantitative factors about the chemical content in wine.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the factors affecting the quality of red/white wine. I suspected that the alcohol, residual.sugar and PH will affect the quality of red/white wine. The other point of interest is the difference between red/white wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the description of the variables, it seems that the fixed.acidity & volatile.acidity, free.sulfur.dioxide & total.sulfur.dioxide, alcohol & density can be corralated variables.

Did you create any new variables from existing variables in the dataset?

Yes, ‘color’ was the created new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Factors like residual.sugar/free.sulfur.dioxide has significant outliers. However, considering the unit used, the outliers can be accepted and the data is tidy data.

Bivariate Plots Section

From the nature of the chemicals, let’s examine the correlation group by group. The first group is about acids, pH and quality:

The acids does not have strong relationship with quality. Among the factors, volatile.acids has the greater R (-0.266). Surprisingly, the volatile.acids has negative correlation with citric.acid, and pH (a log scale acidity) has positive relationship to volatile.acids. We wil examine this relations further.

That’s then examine the second group of factors, residual sugar, alcohol, density, sulphate, chloride and quality.

This result gets along well with our physics knowledge, the sugar add to density and alcohol content reduces density. Among the factors, alcohol and chlorides are most critical independent factors (density can be seen as a dependent factor) that influence quality.

Let’s see the last group of data.

The last group shows a rather weak relationship. The type/color does not seem to influence the quality and all three other factors have weak relationship with quality. The strongest relationship is between free.sulfur.dioxide and total.sulfur.dioxide. However, it can be seen from their names…

Now let’s examine the key factors, alcohol, chlorides, and volatile.acids and their relationship to quality. We will also exmine the difference between red and white wine. Here I will include one more factor I am intersted in —the residual sugar.

The alcohol has positive relationship with quality, while chlorides and volatile.acidity will decrease the quality.

Among red and white wine, white wine has less volatile.acidity and chlorides, more sugar, and a slightly higher alcohol content.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Density has a negative relationship with alcohol. It also has positive correlation with residual sugar. The correlation coefficients are -0.687 and 0.553 respectively.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  1. The white wine tend to have more alcohol, more residual sugar and less acids, chlorides.

  2. As it has been assumed in section 1, there are some instinct relationship between the variables. For example, the free.sulfur.dioxide and total.sulfur.dioxide are highly correlated. pH has negative relationship with acids.

What was the strongest relationship you found?

The strongest relationship is between density and alhocol (R=-0.687), which makes sense because alhocol has smaller desity than water (desity = 49.3 lb/ft^3 and 62.4 lb/ft^3)

Multivariate Plots Section

In both red and white wine, the alcohol positively influence the quality.

In both red and white wine, the volatile.acidity negatively influence the quality. However, the red wine is more sensitive while the relationship between volatile.acidity and quality of white wine is relatively weak.

In both red and white wine, the chlorides negatively influence the quality, although red wine has higher chlorides content in every level of rating.

In the last section, we have examined the relationship between residual sugar and quality. White wine has a slightly negative relationship while red wine has a positive relationship.

It can be inferred that, we expect a high quality wine more “sweet”" while white wine less “sweet”.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section, I found that the relationships of quality to alcohol,chlorides, and volatile.acids are different among red and white wine.

Were there any interesting or surprising interactions between features?

The standards used to judge the quality of red wine and white wine are different. For red wine, the residual sugar has a positive relationship with the quality. However, for white wine, it is negatively related to the quality.

volatile.acidity has a negative effect in Red wine but White wine is not sensitive to volatile.acidity.

Both wine show some trend under the influence of alcohol and chlorides.


Final Plots and Summary

Plot One: Quality Distribution of Red & White Wine

This figure shows the distribution of wine ratings. Among the 1599 red wine sample and 4898 white wine sample, most samples were rated as 5 or 6, 2000+ and 2800+ respecively. The quality of wine has a slightly skewed normal distribution. The lowest rating is 3 and the highest rating is 9.In the samples rated under 6, the red wine takes about one third portion. However, in high-rated samples, red wine takes a much smaller portion.

Plot Two: Difference between Red & White Wine

This picture depicts the difference between red and white wine. Red wine has more acids, more chlorides, less sugar and slightly less alcohol. The greatest difference from the figure is the volatile acidity, the red wine has an average of 0.5 g/L while white wine only has 0.2 g/L. All group of data has a few significant outliers.

Plot Three: The Alcohol vs.Wine Quality

volatile.acidity has high impact on quality. This figure depicts on how red and white wine behave differently in terms of the content of volatile.acidity. For red wine, the average volatile.acidity content is higher than white wine, and the quality is more sensitive to the change of the volatile.acidity. Overall, volatile.acidity has positive relationship with the wine quality.


Reflection

“The biggest difference between reds and whites is in how they’re made. The grapes used for red and white wines generally look very different—as you might imagine, red wine grapes are darker and have more pigment. When making white wine, typically the grapes are pressed and then just the juice is fermented.”1

The nature and brewing processes made the telling difference. Through the data, we looked into the differences between red and white wine from their chemical contents. Compared to the red wine, the white wine tend to have higher alcohol, more residual sugar and less acids, less chlorides (probably because of the brewing process).

Some facotrs affecting quality also differed in red and white wine. Residual sugar and acids made positive contribution to the quality but they will decrease the taste for white wine. Sulphate positively influenced the red wine quality but white wine seems to be insensitive to this chemical. Both wine proned to rate higher alcohol content as better quality.

After all, quality rating is a relatively subjective factor. Human-beings, even the experts have their limits in distinguishing the tiny difference between each sample, not mentioned the consumers. That’s probably why most wine were rated as 5 or 6. If more extreme cases (below 3 or greater than 8) can be gathered, I would be interested to see why those samples stand out as unique.

Reference: 1. http://www.winespectator.com/drvinny/show/id/44697 2. https://en.wikipedia.org/wiki/Sweetness_of_wine